In the following exercises, we will use the data you have collected in the previous session (all comments for the video “The Census” by Last Week Tonight with John Oliver. You might have to adjust the following code to use the correct file path on your computer.

comments <- readRDS("../data/LWT_Census_parsed.rds")

Next, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).

library(tidyverse)

comments <- comments %>% 
  mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
                                            pattern = "\\\n",
                                            replacement = " "))

Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.

library(quanteda)

toks <- comments %>% 
  pull(TextEmojiDeleted) %>% 
  char_tolower() %>% 
  tokens(remove_numbers = TRUE,
               remove_punct = TRUE,
               remove_separators = TRUE,
               remove_symbols = TRUE,
               remove_hyphens = TRUE,
               remove_url = TRUE)

comments_dfm <- dfm(toks, 
                   remove = quanteda::stopwords("english"))

1

Which are the 20 most frequently used words in the comments on the video “The Census” by Last Week Tonight with John Oliver? Save the overall word ranking in a new object called term_freq.
You can use the function textstat_frequency() from the quanteda package to answer this question.
term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)
##       feature frequency rank docfreq group
## 1      census      1763    1    1340   all
## 2      people       991    2     728   all
## 3        just       752    3     654   all
## 4        like       619    4     526   all
## 5         one       520    5     432   all
## 6       trump       514    6     457   all
## 7         can       494    7     432   all
## 8        know       453    8     402   all
## 9        john       438    9     406   all
## 10        get       434   10     386   all
## 11 government       396   11     317   all
## 12   question       394   12     329   all
## 13   citizens       369   13     270   all
## 14         us       368   14     299   all
## 15       many       365   15     315   all
## 16      think       293   16     269   all
## 17       even       292   17     271   all
## 18    country       288   18     240   all
## 19    illegal       281   19     218   all
## 20     oliver       271   20     252   all